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Abstract 

^ . The present paper is devoted to foundations of p-adic modelling in ge- 

\ nomics. Considering nucleotides, codons, DNA and RNA sequences, amino 

' acids, and proteins as information systems, we have formulated the corre- 

■ sponding p-adic formalisms for their investigations. Each of these systems 

\ has its characteristic prime number used for construction of the related 

' information space. Relevance of this approach is illustrated by some ex- 

' amples. In particular, it is shown that degeneration of the genetic code is 

a p-adic phenomenon. We have also put forward a hypothesis on evolu- 
tion of the genetic code assuming that primitive code was based on single 
^ ■ nucleotides and chronologically first four amino acids. This formalism of 

. j»-adic genomic information systems can be implemented in computer pro- 

grams and applied to various concrete cases. 



1 Introduction 

Living organisms seem to be the most complex, interesting and significant objects 
regarding all substructures of the universe. One of the essential characteristics 
that differ a living organism from all other material systems is related to its 
genome. The genome of an organism is its whole hereditary information encoded 
in the desoxyribonucleic acid (DNA), and contains both genes and non-coding 
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sequences. In some viruses, which are between hving and non-hving objects, 
genetic material is encoded in the ribonucleic acid (RNA). Investigation of the 
entire genome is the subject of genomics. The human genome, which presents 
all genetic information of the Homo sapiens, is composed of more than 3 ■ 10^ 
DNA base pairs and contains more than 3 • 10^ genes [1]. For a more detailed and 
comprehensive information on molecular biology aspects of DNA, RNA and the 
genetic code one can use Ref. P^. To have a self-contained exposition we shall 
briefly review some necessary basic properties of genomics. 

The DNA is a macromolecule composed of two polynucleotide chains with a 
double-hehcal structure. Nucleotides consist of a base, a sugar and a phosphate 
group. Helical backbone is a result of the sugar and phosphate groups. There 
are four bases and they are building blocks of the genetic information. They 
are called adenine (A), guanine (G), cytosine (C) and thymine (T). Adenine and 
guanine are derived from purine, while cytosine and thymine from pyrimidine. 
In the sense of information, the nucleotide and its base present the same object. 
Nucleotides are arranged along chains of double helix through base pairs A-T and 
C-G bonded by 2 and 3 hydrogen bonds, respectively. As a consequence of this 
pairing there is an equal number of cytosine and guanine as well as the equal rate 
of adenine and thymine. DNA is packaged in chromosomes which are localized 
in the nucleus of the eukaryotic cells. 

The main role of DNA is to storage genetic information and there are two 
main processes to exploit this information. The first one is replication, in which 
DNA duplicates giving two new DNA containing the same information as the 
original one. This is possible owing to the fact that each of two chains contains 
complementary bases of the other one. The second process is related to the gene 
expression, i.e. the passage of DNA gene information to proteins. It performs by 
the messenger ribonucleic acid (mRNA), which is usually a single polynucleotide 
chain. The mRNA is synthesized during the first part of this process, known as 
transcription, when nucleotides C, A, T, G from DNA are respectively transcribed 
into their complements G, U, A, C in mRNA, where T is replaced by U (U is the 
uracil, which is a pyrimidine). The next step in gene expression is translation, 
when the information coded by codons in the mRNA is translated into proteins. 
In this process also participate transfer tRNA and ribosomal rRNA. 

Codons are ordered trinucleotides composed of C, A, U (T) and G. Each of 
them presents an information which controls use of one of the 20 standard amino 
acids or stop signal in synthesis of proteins. 

Protein synthesis in all eukaryotic cells performs in the ribosomes of the cyto- 
plasm. Proteins [21 are organic macromolecules composed of amino acids arranged 
in a linear chain. Amino acids are molecules that consist of amino, carboxyl and 
R (side chain) groups. Depending on R group there are 20 standard amino acids. 
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These amino acids are joined together by a peptide bond. Proteins are sub- 
stantial ingredients of all living organisms participating in various processes in 
cells and determining the phenotype of an organism. In the human body there 
may be about 2 million different proteins. The study of proteins, especially their 
structure and functions, is called proteomics. The proteome is the entire set of 
proteins in an organism. 

The sequence of amino acids in a protein is determined by sequence of codons 
contained in DNA genes. The relation between codons and amino acids is known 
as the genetic code. Although there are at least 16 codes (see, e.g. [3]), the most 
important are two of them: the standard (eukaryotic) code and the vertebral 
mitochondrial code. 

In the sequel we shall mainly have in mind the vertebral mitochondrial code, 
because it is a simple one and the others may be regarded as its slight modifica- 
tions. It is obvious that there are 4 x 4 x 4 = 64 codons. However (in the vertebral 
mitochondrial code), 60 of them are distributed on the 20 different amino acids 
and 4 make stop codons, which serve as termination signals. According to exper- 
imental observations, two amino acids are coded by six codons, six amino acids 
by four codons, and twelve amino acids by two codons. This property that some 
amino acids are coded by more than one codon is known as genetic code degener- 
acy. This degeneracy is a very important property of the genetic code and gives 
an efficient way to minimize errors caused by mutations. 

Since there is in principle a huge number (between 10''^ and 10*^ of all 
possible assignments between codons and amino acids, and only a very small 
number of them is represented in living cells, it has been a persistent theoretical 
challenge to find an appropriate model explaining contemporary genetic codes. 
An interesting model based on the quantum algebra lAq{sl{2) © sl{2)) in the 
g — ^> limit was proposed as a symmetry algebra for the genetic code (see [3] and 
references therein). In a sense this approach mimics quark model of baryons. To 
describe correspondence between codons and amino-acids, it was constructed an 
operator which acts on the space of codons and its eigenvalues are related to amino 
acids. Besides some successes of this approach, there is a problem with rather 
many parameters in the operator. There are also papers (see, e.g. [1], [S] and [6J) 
starting with 64-dimensional irreducible representation of a Lie (super) algebra 
and trying to connect multiplicity of codons with irreducible representations of 
subalgebras arising in a chain of symmetry breaking. Although interesting as an 
attempt to describe evolution of the genetic code these Lie algebra approaches 
did not succeed to get its modern form. For a very brief review of these and some 
other theoretical approaches to the genetic code one can see Ref. [3]. There is 
still no generally accepted explanation of the genetic code. 

It is worth recalling emergence of special theory of relativity and quantum 
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mechanics. They both appeared as a result of unsatisfactory attempts to extend 
classical concepts to new physical phenomena, and introduction of new physical 
ideas and mathematical methods. Although far from everyday experience these 
new two theories describe physical reahty quite successfully. We believe that 
similar situation should happen in theoretical description of living processes in 
biological organisms. Ultrametric and p-adic methods seem to be very promising 
tools in further investigation of the life. 

Modelling of the genome, the genetic code and proteins is a challenge as well 
as an opportunity for applications of p-adic mathematical physics. Recently [7j 
we introduced a p-adic approach to DNA and RNA sequences, and to the genetic 
code. The central point of our approach is an appropriate identification of four 
nucleotides with digits 1, 2, 3, 4 of 5-adic integer expansions and application of p- 
adic distances between obtained numbers. 5-Adic numbers with three digits form 
64 integers which correspond to 64 codons. In [8] we analyzed p-adic degeneracy of 
the genetic code. As one of the main results that we have obtained is explanation 
of the structure of the genetic code degeneracy using p-adic distance between 
codons. A similar approach to the genetic code was considered on diadic plane 

Let us mention that p-adic models in mathematical physics have been actively 
considered since 1987 (see [10], [11] for early reviews and [12], [13] for some 
recent reviews). It is worth noting that p-adic models with pseudodifferential 
operators have been successfully applied to interbasin kinetics of proteins [2]. 
Some p-adic aspects of cognitive, psychological and social phenomena have been 
also considered [L5\. The recent application of p-adic numbers in physics and 
related branches of sciences is reflected in the proceedings of the 2nd International 
Conference on p-Adic Mathematical Physics [16] . 

This paper is devoted to the further p-adic modelling of the genome as well 
as to p-adic roots of the genetic code evolution based on approach introduced in 
[7] and considered in |8j. 



2 p-Adic Genome 

In Introduction we presented a brief review of the genome and the genetic code, 
as well as some motivations for their p-adic theoretical investigations. To consider 
p-adic properties of the genome and the genetic code in a self-contained way we 
shall also recall some mathematical preliminaries. 
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2.1 Some mathematical preliminaries and j9-adic codon 
space 

As a new tool to study the Diophantine equations, p-adic numbers are introduced 
by German mathematician Kurt Hensel in 1897. They are involved in many 
branches of modern mathematics, either as rapidly developing topics or as suitable 
applications. An elementary introduction to p-adic numbers can be found in the 
book [T7]. However, for our purposes we will use here only a small portion of 
p-adics, mainly some finite sets of integers and ultrametric distances between 
them. 

Let us introduce the set of natural numbers 

C5[64] = {no + riiS + nsS^ : n, = l,2,3,4}, (1) 

where are digits related to nucleotides by the following assignments: C (cyto- 
sine) = 1, A (adenine) = 2, T (thymine) = U (uracil) = 3, G (guanine) = 4. 
This is a finite expansion to the base 5. It is obvious that 5 is a prime number 
and that the set C5[64] contains 64 numbers between 31 and 124 in the usual 
base 10. In the sequel we shall often denote elements of C5[64] by their digits to 
the base 5 in the following way: + 5 + n2 5^ = nQnin2- Note that here 
ordering of digits is the same as in the expansion, i.e this ordering is opposite to 
the usual one. There is now evident one-to-one correspondence between codons 
in three-letter notation and number no ni n2 representation. 

It is also often important to know a distance between numbers. Distance can 
be defined by a norm. On the set Z of integers there are two kinds of nontrivial 
norm: usual absolute value | ■ |oo and p-adic absolute value | ■ |p , where p is 
any prime number. The usual absolute value is well known from elementary 
mathematics and the corresponding distance between two numbers x and y is 
doo{x,y) = \x -y\oo- 

The p-adic absolute value is related to the divisibility of integers by prime 
numbers. Difference of two integers is again an integer. p-Adic distance between 
two integers can be understood as a measure of divisibility by p of their difference 
(the more divisible, the shorter). By definition, p-adic norm of an integer m G Z, 
is \m\p = p~^ ^ where k G N|J{0} is degree of divisibility of m by prime p (i.e. 
m = p^ m' , p\ m') and |0|p = 0. This norm is a mapping from Z into non-negative 
rational numbers and has the following properties: 

(i) \Av — 0? kip = if and only if a; = 0, 

(ii) \xy\p = \x\p \y\p, 

(iii) \x + y\p < max {|a;|p , \y\p} < \x\p + \y\p for a\\ x ,y E Z. 

Because of the strong triangle inequality \x + y\p < max{|a;|p, \y\p}, p-adic ab- 
solute value belongs to non-Archimedean (ultrametric) norm. One can easily 
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conclude that < |m|p < 1 for any m G Z and any prime p. 
p-Adic distance between two integers x and y is 

dp{,x ,y) = \x -y\p. (2) 

Since p-adic absolute value is ultrametric, the p-adic distance ([2]) is also ultra- 
metric, i.e. it satisfies 

dp{x , y) < max {dp{x , z) , dp{z ,y)} < dp{x , z) + dp{z , y) , (3) 

where x, y and z are any three integers. 

The above introduced set C5 [64] endowed by p-adic distance we shall call p- 
adic codon space, i.e. elements of C5 [64] are codons denoted by nonin2. 5-Adic 
distance between two codons a,b & [64] is 

0^5(0, b) = |ao + fli 5 + a2 5^ - 60 - ^1 5 - &2 S^ls , (4) 

where a^, hi G {1,2,3,4}. When a ^ h then ^5(0, h) may have three different 
values: 

• 6/5(0, 6) = 1 if ao 7^ 60, 

• (is (a, 6) = 1/5 if = &o and ai 7^ 61, 

• (^5(0, 6) = 1/5^ if Oq = &o 5 Oi = bi and 02 7^ &2- 

We see that the largest 5-adic distance between codons is 1 and it is maximum 
p-adic distance on Z. The smallest 5-adic distance on the codon space is 5~^. 
Let us also note that 5-adic distance depends only on the first two nucleotides of 
different codons. 

If we apply real (standard) distance doo{a, h) = |ao -l- ai 5 -|- 02 5^ — 60 — ^1 5 — 
&2 5^|oo, then third nucleotides a2 and 62 would play more important role than 
those at the second position (i.e ai and 61), and nucleotides and 60 are of the 
smallest importance. At real C5[64] space distances are also discrete, but take 
values 1, 2, ■ ■ ■ , 93. The smallest real and the largest 5-adic distance are equal 1. 
While real distance describes metric of the ordinary physical space, this p-adic 
one serves to describe ultrametricity of the codon space. 

It is worth emphasizing that the metric role of digits depends on their position 
in number expansion and it is quite opposite in real and p-adic cases. We shall 
see later, when we consider the genetic code, that the first two nucleotides in a 
codon are more important than the third one and that p-adic distance between 
codons is a natural one in description of their information content (the closer, the 
more similar). 
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2.2 j9-Adic genomic and bioinformation spaces 

Appropriateness of the p-adic codon space C5[64] to the genetic code is aheady 
shown in [8] and will be reconsidered in the Section 3. Now we want to extend 
C5[64] space approach to more general genetic and bioinformation spaces. 

Let us recall that four nucleotides are related to prime number 5 by their cor- 
respondence to the four nonzero digits (1, 2, 3, 4) of p = 5. It is unappropriate to 
use the digit for a nucleotide because it leads to non-uniqueness in representa- 
tion of the codons by natural numbers. For example, 123 = 123000 as numbers, 
but 123 represents one and 123000 two codons. This is also a reason why we 
do not use 4-adic representation for codons, since it would contain a nucleotide 
presented by digit 0. One can use as a digit to denote absence of any nucleotide. 

Let us note also that we have used on C5[64], in [7] and [S], not only 5-adic 
but 2-adic distance as well. 

Definition 1. We shall call {p,q)-adic genomic space a double (^p[(p — 
1)™] , dg^ , where 

Tp [{p - ir] = jrio + riip + ■ ■ ■ + n^.ip'"-^ : 

ni = l,2,---,p-l, meN^ (5) 

is the set of natural numbers, dq is the corresponding g-adic distance on Tp [{p — 
1)™] and nonzero digits rii are related to some p—1 basic constituents of a genomic 
system (or to any other biological information system) in a unique way. Index q 
is a prime number. 

Here m can be called also multiplicity of space elements with respect to their 
constituents. In addition to dp there can be a few other dq useful distances on 

Tp[ip-ir]. 

For simplicity, we shall often call Tp [(p — 1)™] p-adic genomic space and use 
notation 

no +nip ^ + Hra-i p^~^ = nofii ■■■ rim-i , (6) 

where ordering of digits is in the opposite direction to the standard one and seems 
here more natural. Earlier introduced codon space C5 [64] can be regarded as a 
significant example of the p-adic genomic spaces, i.e. C5 [64] = Fs [(5 — 1)^] as 
space of trinucleotides. Two other examples, which will be used later, are: F5 [4] 
- space of nucleotides and F5 [4^] - space of dinucleotides. 

Now we can introduce p-adic bioinformation space as a space composed of 
some p-adic genomic spaces. 
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Definition 2. Let Bp [N] be a p-adic space composed of N natural numbers. 
We sliall call Bp [N] p-adic bioinformation space when it can be presented as 

Bp [N] c n rp[(p - 1)'"] , (7) 

m=mi 

where nii and m2 are positive integers (mi < 777.2), which determine the range of 
multiplicity between rrii and 7772- In the sequel we shall present some concrete 
examples of the Bp [N] spaces. 

2.3 DNA and RNA spaces 

DNA sequences can be considered as a union of coding and non-coding segments. 
Coding parts are composed of codons included into genes, which are rather com- 
plex systems. In coding segments is stored information, which through a series of 
complex processes is translated into proteins. The space of coding DNA sequences 
(cDNA) can be presented as 

m2 

cDNA[N] C n r6i[60"*], (8) 

m=mi 

where p = 61 because there are 60 codons coding amino acids (in the vertebral 
mitochondrial code). Thus cDNA space can be regarded as a set of N coded 
sequences as well as a set of discrete points (a lattice) of nmLmi ^ei [60™] 
space. While in C5 [64] codons are space elements, in cDNA [N] they are building 
units. 

The structure and function of non-coding sequences is still highly unknown. 
They include information on various regulatory processes in the cell. We assume 
that the space of non-coding DNA sequences {ncDNA) is a subspace 

ncDNAd r5[4"^], (9) 

m=m\ 

where r77i and 7772 are minimum and maximum values of the size of non-coding 
segments. 

In a similar way one can construct a space of all RNA sequences in the cell. 
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2.4 Protein space 

We mentioned some basic properties of proteins in Introduction. Recall also 
that proteins functional properties depend on their three-dimensional structure. 
There are four distinct levels of protein structure (primary, secondary, tertiary 
and quaternary) p]. The primary structure is determined by the amino acid 
sequence and the other ones depend on side chains of amino acids (see Table [3]). 
In addition to 20 standard amino acids, presented in the Table |3l there are also 2 
special nonstandard amino acids: selenocysteine and pyrrolysine [18]. They are 
also coded by codons, but are very rare in proteins. Thus there are 22 amino 
acids encoded in the genetic code. According to Jukes [19] non-freezing code may 
contain 28 amino acids. 

The 20 standard (canonical) amino acids employed by the genetic code in 
proteins of the living cells are listed in Table [31 Some their important chemical 
properties are presented in Table [31 

Now we want to construct an appropriate space whose elements are proteins. 
We propose protein space Vp to be a subspace of product of genomic spaces 

m2 

v,[N](i n r,[(p-i)-], (10) 

m=mi 

where the building units are amino acids. Thus Vp [N] is a space of proteins 
with size measured by the number of amino acids between mi and m2 (mi ~ 10 
and 1712 ~ 10^). 

In (ITOl) prime number p is related to the number of amino acids by relation: 
p — 1 = number of different amino acids used as building blocks in proteins. At 
present time there are 22 amino acids (20 standard and 2 special) and conse- 
quently p = 23 . One can argue that not all 22 amino acids have been from the 
very beginning of life and that has been an evolution of amino acids. Namely, us- 
ing 60 different criteria for temporal order of appearance of the 20 standard amino 
acids the obtained result [21j is presented in Table [31 The first four amino acids 
(Gly, Ala, Asp and Val) have the most production rate in Miller's experiment 
of an imitation of the atmosphere of the early Earth. This could correspond to 
p = b and single nucleotide codons in a primitive code. In the case of dinucleotide 
code there are 16 codons and maximum amino acids that can be coded is 16, i.e. 
p = 17. As we already mentioned, according to Jukes pJj, it is possible to code 
28 amino acids by trinucleotide code and it gives the corresponding p = 29. 
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3 ]9-Adic Genetic Code 



An intensive study of the connection between ordering of nucleotides in DNA (and 
RNA) and ordering of amino acids in proteins led to the experimental deciphering 
of the standard genetic code in the mid-1960s. The genetic code is understood as a 
dictionary for translation of information from DNA (through RNA) to synthesis of 
proteins by amino acids. The information on amino acids is contained in codons: 
each codon codes either an amino acid or termination signal (see, e.g. Table |3] 
as a standard table of the vertebral mitochondrial code). To the sequence of 
codons in RNA corresponds quite definite sequence of amino acids in a protein, 
and this sequence of amino acids determines primary structure of the protein. 
The genetic code is comma-free and non-overlapping. At the time of deciphering, 
it was mainly believed that the standard code is unique, result of a chance and 
fixed a long time ego. Crick [22] expressed such belief in his "frozen accident" 
hypothesis, which has not been supported by later observations. Moreover, it 
has been discovered so far at least 16 different codes and found some general 
regularities. At first glance the genetic code looks rather arbitrary, but it is not. 
Namely, mutations between synonymous codons give the same amino acid. When 
mutation alter an amino acid then it is like substitution of the original by similar 
one. In this respect the code is almost optimal. 



Despite of remarkable experimental successes, there is no simple and generally 
accepted theoretical understanding of the genetic code. There are many papers in 
this direction (in addition to already cited, see also, e.g. [23j and |24|), scattered 
in various journals, with theoretical approaches based more or less on chemical, 
biological and mathematical aspects of the genetic code. Even before deciphering 
of the code there have been very attractive theoretical inventions (of Gamow and 
Crick), but the genetic code occurred to be quite different (for a review on the 
early inventions around the genetic code, see [25]). However, the foundation of 
biological coding is still an open problem. In particular, it is not clear why genetic 
code exists just in few known ways and not in many other possible ones. What is 
a principle (or principles) employed in establishment of a basic (mitochondrial) 
code? What are properties of codons connecting them into definite multiplets 
which code the same amino acid or termination signal? Answers to these and 
some other questions should lead us to discover an appropriate theoretical model 
of the genetic code. 
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TABLE 1. List of 20 standard amino acids used in proteins by living cells. 3- 
Letter and 1-letter abbreviations, and chemical structure of their side chains are 
presented. 



AMINO ACID 


ABBR. 


SIDE CHAIN (R) 


Alanine 


Ala, A 




Cysteine 


Cys, C 


-CH2SR 


Aspartate 


Asp, D 


-Ci/sCOOH 


Clutamate 


Ghi, E 


-{CH2)2COOH 


Phenynalanine 


Phe, F 


-CH2CQH5 


Glycine 


Gly, G 


-H 


Histidine 


His, H 


-C H2-C^H^N2 


Isoleucine 


He, I 


-CH{CH3)CH2CHs 


Lysine 


Lys, K 


-{CH2)aNH2 


Leucine 


Leu, L 


-C H2C H {C H^)2 


Methionine 


Met, M 


-{CH2)2SCH^ 


Asparagine 


Asn, N 


-CH2CONH2 


Proline 


Pro, P 


-{CH2h- 


Glutamine 


Gin, Q 


-{CH2)2CONH2 


Arginine 


Arg, R 


-{CH2)^NHC{NH)NH2 


Serine 


Ser, S 


-CH2OH 


Threonine 


Thr, T 


-CH{OH)CH^ 


Vahne 


Val, V 


-CH{CH,)2 


Tryptophan 


Trp, W 


-CH2CsHeN 


Tyrosine 


lyr, Y 


-CH2-C6H4OH 
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TABLE 2. Some chemical properties of 20 standard amino acids, p + n is 
number of nucleons. These and some other recent chemical values can he found 
zn t20j. 



rVLlLLLLU 


p+n 


Jr OlcLi 


nyQiu- 


ill pioLeiiis 


cLcids 








/o 


AIq a 




no 


yes 


7 8 
( .o 


uys, 




no 


yes 


1 Q 

i.y 


/iSp, L) 


loo 


yes 


no 


O.o 


Pin P 
VoriU, 11; 


1/17 


yes 


no 


P. Q 
O.o 


r ne, r 


1 Rc; 

ioo 


no 


yes 


Q 

o.y 




i o 


no 


yes 


7 9 


His, H 




yes 


no 


2 3 


He, I 


131 


no 


yes 


5.3 


Lys, K 


146 


yes 


no 


5.9 


Leu, L 


131 


no 


yes 


9.1 


Met, M 


149 


no 


yes 


2.3 


Asn, N 


132 


yes 


no 


4.3 


Pro, P 


115 


no 


yes 


5.2 


Gin, Q 


146 


yes 


no 


4.2 


Arg, R 


174 


yes 


no 


5.1 


Ser, S 


105 


yes 


no 


6.8 


Thr, T 


119 


yes 


no 


5.9 


Val, V 


117 


no 


yes 


6.6 


Trp, W 


204 


no 


yes 


1.4 


Tyr, Y 


181 


yes 


yes 


3.2 



TABLE 3. Temporal appearance of the 20 standard amino acids [2T] . 

(1) Gly (2) Ala (3) Asp (4) Val 

(5) Pro (6) Ser (7) Glu (8) Leu 

(9) Thr (10) Arg (11) He (12) Gin 

(13) Asn (14) His (15) Lys (16) Cys 

(17) Phe (18) Tyr (19) Met (20) Trp 
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TABLE 4. The standard (Watson- Crick) table of the vertebral mitochondrial 
code. Ter denotes the terminal (stop) signal. 



UUU Phe 
UUC Phe 
UUA Leu 
UUG Leu 


UCU Ser 
UCC Ser 
UGA Ser 
UGG Ser 


UAU Tyr 
UAC Tyr 
UAA Ter 
UAG Ter 


UGU Cys 
UGC Cys 
UGA Trp 
UGG Trp 


CUU Leu 
cue Leu 
CUA Leu 
CUG Leu 


ecu Pro 
CCC Pro 
CCA Pro 
GGG Pro 


CAU His 
GAG His 
CAA Gin 
GAG Gin 


ecu Arg 
CCC Arg 
CGA Arg 
GGG Arg 


AUU lie 
AUG He 
AUA Met 
AUG Met 


ACU Thr 
ACC Thr 
ACA Thr 
AGG Thr 


AAU Asn 
AAC Asn 
AAA Lys 
AAG Lys 


AGU Ser 
ACC Ser 
AGA Ter 
AGG Ter 


CUU Val 
cue Val 
GUA Val 
GUG Val 


ecu Ala 
GCC Ala 
GCA Ala 
GGG Ala 


CAU Asp 
GAG Asp 
GAA Glu 
GAG Glu 


GGU Gly 
GCC Gly 
GGA Gly 
GGG Gly 
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TABLE 5. Our table of the vertebral mitochondrial code in the usual notation. 



CCC Pro 
CCA Pro 
ecu Pro 
CCC Pro 


ACC Thr 
ACA Thr 
ACU Thr 
ACG Thr 


UCC Ser 
UCA Ser 
UCU Scr 
UCG Ser 


GCC Ala 
GCA Ala 
GCU Ala 
GCG Ala 


CAC His 
CAA Gin 
CAU His 
GAG Gin 


AAC Asn 
AAA Lys 
AAU Asn 
AAG Lys 


UAC Tyr 
UAA Ter 
UAU Tyr 
UAG Ter 


GAC Asp 
GAA Glu 
GAU Asp 
GAG Glu 


cue Leu 
CUA Leu 
CUU Leu 
CUG Leu 


AUC He 
AUA Met 
AUU He 
AUG Met 


UUC Phe 
UUA Leu 
UUU Phe 
UUG Leu 


GUC Val 
GUA Val 
GUU Val 
GUG Val 


CGC Arg 
CGA Arg 
CGU Arg 
CGG Arg 


AGC Ser 
AGA Ter 
AGU Ser 
AGG Ter 


UGC Cys 
UGA Trp 
UGU Cys 
UGG Trp 


GGC Gly 
GGA Gly 
GGU Gly 
GGG Gly 
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TABLE 6. Our 5-adic table of the vertebral mitochondrial code, which is a 
representation of the C5 [64] codon space. 



111 Pro 

112 Pro 

113 Pro 

114 Pro 


211 Thr 

212 Thr 

213 Thr 

214 Thr 


311 Ser 

312 Scr 

313 Ser 

314 Ser 


411 Ala 

412 Ala 

413 Ala 

414 Ala 


121 His 

122 Gin 

123 His 

124 Gin 


221 Asn 

222 Lys 

223 Asn 

224 Lys 


321 Tyr 

322 Ter 

323 Tyr 

324 Ter 


421 Asp 

422 Glu 

423 Asp 

424 Glu 


131 Leu 

132 Leu 

133 Leu 

134 Leu 


231 He 

232 Met 

233 lie 

234 Met 


331 Phe 

332 Leu 

333 Phe 

334 Leu 


431 Val 

432 Val 

433 Val 

434 Val 


141 Arg 

142 Arg 

143 Arg 

144 Arg 


241 Scr 

242 Ter 

243 Ser 

244 Ter 


341 Cys 

342 Trp 

343 Cys 

344 Trp 


441 Gly 

442 Gly 

443 Gly 

444 Gly 
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Let us now turn to Table O We observe that this table can be regarded as 
a big rectangle divided into 16 equal smaller rectangles: 8 of them are quadru- 
plets which one-to-one correspond to 8 amino acids, and other 8 rectangles are 
divided into 16 doublets coding 14 amino acids and termination (stop) signal 
(by two doublets at different places). However there is no manifest symmetry in 
distribution of these quadruplets and doublets. 

In order to get a symmetry we have rewritten this standard table into new 
one by rearranging 16 rectangles. As a result we obtained Table [3] which exhibits 
a symmetry with respect to the distribution of codon quadruplets and codon 
doublets. Namely, in our table quadruplets and doublets form separately two 
figures, which are symmetric with respect to the mid vertical line (a left-right 
symmetry), i.e. they are invariant under interchange C <-> G and A ^ U at the 
first position in codons at all horizontal lines. Recall that also DNA is symmetric 
under simultaneous interchange of complementary nucleotides C G and A T 
between its strands. All doublets in this table form a nice figure which looks like 
letter T. 

Table E] contains the same distribution of amino acids as Table El but codons 
are now presented by 5-adic numbers nQnin2 instead of capital letters (recall: C 
= 1, A= 2, U = 3, G = 4). This new table can be also regarded as a represen- 
tation of the C5 [64] codon space with gradual increasing of integers from 111 to 
444. The observed left-right symmetry is now invariance under the corresponding 
transformations 1^4 and 2^3. In other words, at each horizontal line one 
can perform doublet ^ doublet and quadruplet ^ quadruplet interchange around 
vertical midline. 

It is worth noting that the above invariance leaves also unchanged polarity 
and hydrophobicity of the corresponding amino acids in all but three cases: Asn 
^ Tyr, Arg ^ Gly, and Ser ^ Cys. 

3.1 Degeneracy of the genetic code 

Let us now explore distances between codons and their role in formation of the 
genetic code degeneration. 

To this end let us again turn to Table [3] as a representation of the C5 [64] codon 
space. Namely, we observe that there are 16 quadruplets such that each of them 
has the same first two digits. Hence 5-adic distance between any two different 
codons within a quadruplet is 

^5(0, b) = |ao + cti 5 + a2 5^ — — fli 5 — 62 5^|5 

= |(a2 - &2) 5^15 = 1(^2 - ^2)15 |5'|5 = , (H) 
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because = bo, ai = hi and |a2 — &2I5 = 1- According to (fTT]) nucleotides within 
every quadruplet are at the smallest distance, i.e. they are closest comparing to 
all other nucleotides. 

Since codons are composed of three arranged nucleotides, each of which is 
either a purine or a pyrimidine, it is natural to try to quantify similarity inside 
purines and pyrimidines, as well as distinction between elements from these two 
groups of nucleotides. Fortunately there is a tool, which is again related to the 
p-adics, and now it is 2-adic distance. One can easily see that 2-adic distance 
between pyrimidines C and U is ^2(1, 3) = |3 — II2 = 1/2 as the distance between 
purines A and G, namely ^2(2, 4) = |4 — 2|2 = 1/2. However 2-adic distance 
between C and A or G as well as distance between U and A or G is 1 (i.e. 
maximum) . 

With respect to 2-adic distance, the above quadruplets may be regarded as 
composed of two doublets: a = a^ail and 6 = oq ai 3 make the first doublet, and 
c = ao ai 2 and = oq ai 4 form the second one. 2-Adic distance between codons 
within each of these doublets is |, i.e. 

d2(a, 6) = 1(3 - 1) 5^12 = \ , rf2(c, rf) = 1(4 - 2) h\ = \ , (12) 

because 3 — 1= 4 — 2 = 2. 

One can now look at Table [3] as a system of 32 doublets. Thus 64 codons are 
clustered by a very regular way into 32 doublets. Each of 21 subjects (20 amino 
acids and 1 termination signal) is coded by one, two or three doublets. In fact, 
there are two, six and twelve amino acids coded by three, two and one doublet, 
respectively. Residual two doublets code termination signal. 

Note that 2 of 16 doublets code 2 amino acids (Ser and Leu) which are already 
coded by 2 quadruplets, thus amino acids Serine and Leucine are coded by 6 
codons (3 doublets). 

To have a more complete picture on the genetic code it is useful to consider 
possible distances between codons of different quadruplets as well as between 
different doublets. Also, we introduce distance between quadruplets or between 
doublets, especially when distances between their codons have the same value. 
Thus 5-adic distance between any two quadruplets in the same column is 1/5, 
while such distance between other quadruplets is 1. 5-Adic distance between 
doublets coincides with distance between quadruplets, and this distance is ^ 
when doublets are within the same quadruplet. 

The 2-adic distances between codons, doublets and quadruplets are more com- 
plex. There are three basic cases: 

• codons differ only in one digit. 
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• codons differ in two digits, 

• codons differ in all three digits. 

In the first case, 2-adic distance can be | or 1 depending whether difference 
between digits is 2 or not, respectively. 

Let us now look at 2-adic distances between doublets coding leucine and also 
between doublets coding serine. These are two cases of amino acids coded by 
three doublets. One has the following distances: 

• ^2(332, 334) = ^2(132, 134) = I for leucine, 

• (i2(311,241) = 4(313,243) = i for serine. 

If we use usual distance between codons, instead of p-adic one, then we would 
observe that two synonymous codons are very far (at least 25 units), and that 
those which arc close code different amino acids. Thus we conclude that not usual 
metric but ultrametric is inherent to codons. 

How degeneracy of the genetic code is connected with p-adic distances between 
codons? The answer is in the following p-adic degeneracy principle: Two 
codons have the same meaning with respect to amino acids if they are at smallest 
5-adic and 0.5 2-adic distance. Here p-adic distance plays a role of similarity: 
the closer, the more similar. Taking into account all known codes (see the next 
subsection) there is a slight violation of this principle. Now it is worth noting that 
in modern particle physics just broken of the fundamental gauge symmetry gives 
its standard model. There is a sense to introduce a new principle (let us call it 
reality principle): Reality is realization of some broken fundamental principles. 
It seems that this principle is valid not only in physics but also in all sciences. 
In this context modern genetic code is an evolutionary broken the above p-adic 
degeneracy principle. 

3.2 Evolution of the genetic code 

The origin and early evolution of the genetic code are among the most interesting 
and important investigations related to the origin and whole evolution of the life. 
However, since there are no concrete facts from that early period, it gives rise to 
many speculations. Nevertheless, one can hope that some of the hypotheses may 
be tested looking for their traces in the contemporary genomes. 

It seems natural to consider biological evolution as an adaptive development 
of simpler living systems to more complex ones. Namely, living organisms are 
open systems in permanent interaction with environment. Thus the evolution 
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can be modelled by a system with given initial conditions and guided by some 
internal rules taking into account environmental factors. 

We are going now to conjecture on the evolution of the genetic code using our 
p-adic approach to the genomic space, and assuming that preceding codes used 
simpler codons and older amino acids. 

Recall that p-adic genomic space Tp [{p— I)™'] has two parameters: p - related 
to p — 1 building blocks, and m - multiplicity of the building blocks in space 
elements. 

• Case r2 [1] is a trivial one and useless for a primitive code. 

• Case [2™"] with m = 1, 2, 3 does not seem to be realistic. 

• Case Fs [4™] with m = 1, 2, 3 offers a possible pattern to consider evolution 
of the genetic code. Namely, the codon space could evolve in the following 
way: F5 [4] ^ F5 [4^] F5 [4^] = C5 [64] (see also Table E^D- 

According to Table [3] this primary code, containing codons in the single nu- 
cleotide form (C, A, U, G), encoded the first four amino acids: Gly, Ala, Asp and 
Val. From the last column of Table [3] we conclude that the connection between 
digits and amino acids is: 1 = Ala, 2 = Asp, 3 = Val, 4 = Gly. In the primary 
code these digits occupied the first position in the 5-adic expansion (Table [3^ . 
and at the next step, i.e. F5 [4] — > F5 [4^], they moved to the second position 
adding digits 1, 2, 3, 4 in front of each of them. 

In F5 [4^] one has 16 dinucleotide codons which can code up to 16 new amino 
acids. Addition of the digit 4 in front of already existing codons 1,2,3,4 leaves 
their meaning unchanged, i.e. 41 = Ala, 42 = Asp, 43 = Val, and 44 = Gly. 
Adding digits 3, 2, 1 in front of the primary 1, 2, 3, 4 codons one obtains 12 possi- 
bilities for coding some new amino acids. To decide which amino acid was encoded 
by which of 12 dinucleotide codons, we use as a criterion their immutability in 
the trinucleotide coding on the F5 [4^] space. This criterion assumes that amino 
acids encoded earlier are more fixed than those encoded later. According to this 
criterion we decide in favor of the first row in each rectangle of Tableland result 
is presented in Table 13.21 

Transition from dinucleotide to trinucleotide codons occurred by attaching 
nucleotides 1,2,3,4 at the third position, i. e. behind each dinucleotide. By this 
way one obtains new codon space F5 [4^] = C5 [64], which is significantly enlarged 
and provides a pattern to generate known genetic codes. This codon space C5 [64] 
gives possibility to realize at least three general properties of the modern code: 

(i) encoding of more than 16 amino acids, 

(ii) diversity of codes. 
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(iii) stability of the gene expression. 
Let us give some relevant clarifications. 

(i) For functioning of contemporary living organisms it is necessary to code at 
least 20 standard (Table [3]) and 2 non-standard amino acids (selenocysteine and 
pyrrolysine) . Probably these 22 amino acids are also sufficient building units for 
biosynthesis of all necessary contemporary proteins. While [4^] is insufficient, 
the genomic space Fs [4^] offers approximately three codons per one amino acid. 

(ii) The eukariotic (often called standard or universal) code is established 
around 1966 and was thought to be universal, i. e., common to all organisms. 
When the vertebral mitochondrial code was discovered in 1979, it gave rise to 
believe that the code is not frozen and that there are also some other codes which 
are mutually different. According to later evidences, one can say that there are 
at least 16 slightly different mitochondrial and nuclear codes (for a review, see 
|26j . [27] and references therein). Different codes have some codons with different 
meaning. So, in the standard code there are the following changes in Table [31 

• 232 (AUA): Met ^ He, 

• 242 (AGA) and 244 (AGG): Ter ^ Arg, 

• 342 (UGA): Trp ^ Ter. 

Modifications in this 16 codes are not homogeneously distributed on 16 rectangles 
of Table [31 For instance, in all 16 codes codons 41z {i = 1, 2, 3, 4) have the same 
meaning. 

(iii) Each of the 16 codes is degenerate and degeneration provides their sta- 
bility against possible mutations. In other words, degeneration helps to minimize 
CO don errors. 

Genetic codes based on single nucleotide and dinucleotide codons were mainly 
directed to code amino acids with rather different properties. This may be the 
reason why amino acids Glu and Gin are not coded in dinucleotide code (Table 
13.21) . since they are similar to Asp and Asn, respectively. However, to become 
almost optimal, trinucleotide codes have taken into account structural and func- 
tional similarities of amino acids. 

We presented here a hypothesis on the genetic code evolution taking into 
account possible codon evolution, from 1-nucleotide to 3-nucleotide, and amino 
acids temporal appearance. This scenario may be extended to the cell evolution, 
which probably should be considered as a coevolution of all its main ingredients 
(for an early idea of the coevolution, see [28]). 
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TABLE 7. 5-Adic system including digit 0, and containing single nucleotide, 
dinucleotide and trinucleotide codons. If one ignores numbers which contain digit 
in front of any 1, 2, 3 or 4, then one has one-to-one correspondence between 
1-digit, 2-digits, 3-digits numbers and single nucleotides, dinucleotides, trinu- 
cleotides, respectively. It seems that evolution of codons has followed transitions: 
single nucleotides — > dinucleotides — > trinucleotides. 



000 


100 c 


200 A 


300 U 


400 G 


010 
020 
030 
040 


110 cc 
120 CA 
130 CU 
140 CG 


210 AC 
220 AA 
230 AU 
240 AG 


310 UC 
320 UA 
330 UU 
340 UG 


410 GC 
420 GA 
430 GU 
440 GG 


001 


101 


201 


301 


401 


Oil 
021 
031 
041 


111 CCC 

121 CAC 
131 cue 
141 CGC 


211 ACC 

221 AAC 
231 AUC 
241 AGC 


311 UCC 

321 UAC 
331 UUC 
341 UGC 


411 GCC 

421 GAC 
431 GUC 
441 GGC 


002 


102 


202 


302 


402 


012 
022 
032 
042 


112 CCA 
122 CAA 
132 CUA 
142 CGA 


212 ACA 
222 AAA 
232 AUA 
242 AGA 


312 UCA 
322 UAA 
332 UUA 
342 UGA 


412 GCA 
422 GAA 
432 GUA 
442 GGA 


003 


103 


203 


303 


403 


013 
023 
033 
043 


113 ecu 
123 CAU 
133 CUU 
143 CGU 


213 ACU 
223 AAU 
233 AUU 
243 AGU 


313 UCU 
323 UAU 
333 UUU 
343 UGU 


413 GCU 
423 GAU 
433 GUU 
443 GGU 


004 


104 


204 


304 


404 


014 
024 
034 
044 


114 CCG 

124 CAG 
134 CUG 
144 CGG 


214 ACG 

224 AAG 
234 AUG 
244 AGG 


314 UCG 

324 UAG 
334 UUG 
344 UGG 


414 GCG 

424 GAG 
434 GUG 
444 GGG 
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TABLE 8. The dinucleotide genetic code based on the p-adic genomic space 
r5[4^]. Note that it encodes 15 amino acids without stop codon, but encoding 
serine twice. 



11 Pro 


21 Thr 


31 Ser 


41 Ala 


12 His 


22 Asn 


32 Tyr 


42 Asp 


13 Leu 


23 He 


33 Phe 


43 Val 


14 Arg 


24 Ser 


34 Cys 


44 Gly 



4 Concluding Remarks 

There are two aspects of the genetic code related to: 

(i) multiplicity of codons which code the same amino acid, 

(ii) concrete assignment of codon multiplets to particular amino acids. 

The above presented p-adic approach gives quite satisfactory description of 
the aspect (i) . Ultrametric behavior of p-adic distances between elements of the 
C5 [64] codon space radically differs from the usual ones. Quadruplets and dou- 
blets of codons have natural explanation within 5-adic and 2-adic nearness. De- 
generacy of the genetic code in the form of doublets, quadruplets and sextuplets is 
direct consequence of p-adic ultrametricity between codons. p-Adic C5 [64] codon 
space is our theoretical pattern to consider all variants of the genetic code: some 
codes are direct representation of C5 [64] and the others are its slight evolutional 
modifications. 

(ii) Which amino acid corresponds to which multiplet of codons? An answer 
to this question should be expected from connections between physico chemical 
properties of amino acids and anticodons. Namely, enzyme aminoacyl-tRNA 
synthetase links specific tRNA anticodon and related amino acid. Thus there is 
no direct interaction between amino acids and codons, as it was believed for some 
time in the past. 

Note that there are in general 4! ways to assign digits 1, 2, 3, 4 to nucleotides 
C, A, U, G. After an analysis of all 24 possibilities, we have taken C = 1, A = 2, 
U = T = 3, G = 4asa quite appropriate choice. In addition to various properties 
already presented in this paper it exhibits also complementarity of nucleotides in 
the DNA double helix by relation C-FG = A-FT = 5. 

It would be useful to find an analogous connection between 22 amino acids 
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and digits 1, 2, • ■ ■ , 22 in p = 23 representation. Now there are 22! possibilities 
and to explore all of them seems to be a hard task. However, use of computer 
analysis may help to find a satisfactory solution. 

One can express many above considerations of ]9-adic information theory in 
linguistic terms and investigate possible linguistic applications. 

In this paper we have employed p-adic distances to measure similarity between 
codons, which have been used to describe degeneracy of the genetic code. It is 
worth noting that in other contexts p-adic distances can be interpreted in quite 
different meanings. For example, 3-adic distance between cytosine and guanine is 
(i3(l, 4) = i, and between adenine and thymine ds{2, 3) = 1. This 3-adic distance 
seems to be natural to relate to hydrogen bonds between complements in DNA 
double helix: the smaller distance, the stronger hydrogen bond. Recall that C-G 
and A-T are bonded by 3 and 2 hydrogen bonds, respectively. 

The translation of codon sequences into proteins is highly an information- 
processing phenomenon. p-Adic information modelling presented in this paper 
offers a new approach to systematic investigation of ultrametric aspects of DNA 
and RNA sequences, the genetic code and the world of proteins. It can be em- 
bedded in computer programs to explore p-adic side of the genome and related 
subjects. 

The above considerations and obtained results may be regarded as contri- 
butions mainly towards foundations of (i) p-adic theory of information and (ii) 
p-adic theory of the genetic code. 

Summarizing, contributions to 

(i) p-adic theory of information contain: 

• formulation of p-adic genomic space (whose examples are spaces of nu- 
cleotides, dinucleotides and trinucleotides), 

• formulation of p-adic bioinformation space (whose examples are DNA, RNA 
and protein spaces), 

• relation between building blocks of information spaces and some prime num- 
bers; 

(ii) p-adic theory of the genetic code include: 

• description of codon quadruplets and doublets by 5-adic and 2-adic dis- 
tances, 

• observation of a symmetry between quadruplets as well as between doublets 
at our table of codons, 

• formulation of degeneracy principle. 
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• formulation of hypothesis on codon evolution. 

Many problems remain to be explored in the future on the above p-adic ap- 
proach to genomics. Among the most attractive and important themes are: 

• elaboration of the p-adic theory of information towards genomics, 

• evolution of the genome and the genetic code, 

• structure and function of non-coding DNA, 

• ultrametric aspects of proteins, 

• creation of the corresponding computer programs. 
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